1. Difference-in-Differences


1.1 Measuring the effect of a soda tax on sugar-added drink consumption

After the very succesful impact evaluations you have performed in the past weeks, you are contacted by the local government of Pawnee, Indiana. The city is interested in your advice to assess a policy intervention passed with the support of councilwoman Leslie Knope.

The city of Pawnee has been at the spotlight recently, as it has come to be known as the child obesity and diabetes capital of the state of Indiana. Some of the constituents of the city point at the fast food culture and soda sizes across the restaurants in town as a source of the problem. The largest food chain in Pawnee, Paunch Burger, offers its smallest soda size at a whopping 64oz (about 1.9 liters).

The “soda tax”, as it came to be known, came to effect initially at a couple of districts. Fortunately for you, based on an archaic law, residents of Indiana have to demonstrate their residence in the district they intend to dine before being served at any of the restaurants. The latter fact means that Pawnee inhabitants can only buy sugar-added drinks in their respective home district.

soda_size




1.2 Exploring the data

set.seed(42) #for consistent results

library(dplyr) # to wrangle our data
library(tidyr) # to wrangle our data - gather()
library(ggplot2) # to render our graphs
library(haven) # to load our .dta files for the assignment
library(plm) # to run our fixed effects models
library(kableExtra) # to render better formatted tables
library(stargazer) # for formatting your model output

soda_tax_df <- read.csv("./data/soda_tax_df.csv") # simulated data

names(soda_tax_df) # to check the names of the variables in our data
## [1] "id"        "district"  "treatment" "pre_tax"   "post_tax"  "change"


Our dataset soda_tax_df, contains the following information:

  • ìd: A unique number identifier for each of the 7,500 inhabitants of Pawnee
  • district: The name of the district in which the corresponding unit lives
  • treatment: A binary variable that signals whether the subject lived in a district where the tax was implemented
  • pre_tax: The weekly sugar-added drink consumption in ounces before the tax was imposed
  • post_tax: The weekly sugar-added drink consuption in ounces after the tax was imposed
  • change: The difference in sugar-added drink consuption between post- and pre-tax


Let’s check how many units are there in every district and which districts have treated units

We can use the base R table() function, which performs categorical tabulation of data with the variable and its frequency, to check how many people live on each district.

table(soda_tax_df$district) %>% 
  kable() %>% # create kable table
  kable_styling() # view kable table
Var1 Freq
City Hall 1250
Old Eagleton 1250
River Park 1250
Snake Lounge 1250
The Pit 1250
Wamapoke Quarter 1250

Even more, we can create a two-way cross-table to check how many treated units live in each district.

table(soda_tax_df$treatment, soda_tax_df$district) %>% 
  kable() %>% # create kable table
  kable_styling() # view kable table
City Hall Old Eagleton River Park Snake Lounge The Pit Wamapoke Quarter
0 0 0 0 1250 1250 1250
1 1250 1250 1250 0 0 0


Let’s check the distribution of sugar-added drinks consumption at different points in time

We can also utilize summary() to check the distribution of our variables. In this case, we can look at the distribution of our variable of interest at the two points in time:

Pre-tax consumption levels
summary(soda_tax_df$pre_tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   400.6   506.6   509.7   610.8  1687.6

Post-tax consumption levels
summary(soda_tax_df$post_tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   327.0   445.9   449.8   566.1  1705.6





1.3 Modeling and estimating

So far we have ignored time in our estimations. Up until this point, most of the tools we have learnt produce estimates of the counterfactual through explicit assignment rules that work random or as-if-random (e.g. randomized experimental, regression discountinuity, and instrumental set-ups).

Difference-in-differences compares the changes in outcomes over time between units under different treatment states. This allows us to correct for any differences between the treatment and comparison groups that are constant over time assuming that the trends in time are parallel.

a. Calculating without time

If we did not have the pre_tax baseline measure, we would likely utilize the post_tax to explore the average effect on the treated. In this case, we would model this as:

after_model <- lm(post_tax ~ treatment, data = soda_tax_df)
stargazer(after_model, type = "html")
Dependent variable:
post_tax
treatment -146.918***
(3.798)
Constant 523.273***
(2.686)
Observations 7,500
R2 0.166
Adjusted R2 0.166
Residual Std. Error 164.465 (df = 7498)
F Statistic 1,496.245*** (df = 1; 7498)
Note: p<0.1; p<0.05; p<0.01

We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 146.9 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us a comparison of the treatment and control groups after treatment.

To believe the results of our after_model, we would need to believe that the mean ignorability of treatment assignment assumption is fulfilled. We would have to think carefully about possible factors that could differentiate our treatment and control groups. We use a treatment indicator based on the districts where the measure was able to be implemented. Treatment was not fully randomly assigned, so there may be lots of potential confounders that create baseline differences in the scores for those living in Old Eagleton compared to those in Snake Lounge, which also affect the after-treatment comparisons.



b. Including the time dimension

We can introduce the time component to our calculation by incorporating the pre-treatment levels of sugar-added drink consumption, which gives us the diff-in-diff estimand. We could calculate this in a fairly straightforward manner:

did_model <- lm(change ~ treatment, data = soda_tax_df)
stargazer(did_model, type = "html")
Dependent variable:
change
treatment -149.744***
(0.246)
Constant 14.967***
(0.174)
Observations 7,500
R2 0.980
Adjusted R2 0.980
Residual Std. Error 10.671 (df = 7498)
F Statistic 369,242.400*** (df = 1; 7498)
Note: p<0.1; p<0.05; p<0.01

We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 149.7 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us the change, or difference, in sugar-added drink consumption for treatment and control groups.

To believe the results of our did_model, we would need to believe that there are parallel trends between the two groups.


Note: when simulating the data the post_tax was defined as: \(post_tax = 15 + pre\_tax - 150(treatment) + error\)




c. Looking under the hood

Remember the formula from the lecture where we estimate the diff-in-diff effect with time and treatment dummies? We can re-format our data to gather our diff-in-diff estimand

\[Y_{it} = β_0 + β_1D^*_i + β_2P_t + β_{DD}D^∗_i × P_t + q_{it} \]

where \(D^*_i\) tell us if subject \(i\) is in the treatment group and \(P_t\) indicates the point in time (1 for post)

We need to convert our data to a long format to render the time and treatment dummy variables. This is how our data looks like:

head(soda_tax_df, 6)
##   id     district treatment   pre_tax  post_tax    change
## 1  1 Snake Lounge         0 1687.6438 1705.5791 17.935292
## 2  2 Snake Lounge         0  427.2953  438.2526 10.957361
## 3  3 Snake Lounge         0  566.4693  559.6664 -6.802863
## 4  4 Snake Lounge         0  606.9294  623.9057 16.976316
## 5  5 Snake Lounge         0  572.6402  606.8654 34.225177
## 6  6 Snake Lounge         0  496.0813  501.8001  5.718800

We will utilize the pivot_longer() function from tidyr to format our data frame.

soda_tax_df_long <- soda_tax_df %>%
  select(id, pre_tax, post_tax, treatment) %>% # select the columns we are interested in
  pivot_longer(cols = c(pre_tax, post_tax), names_to = "period", values_to = "soda_drank") %>% # grab columns, put names to new variable period and values to new variable soda_drank
  mutate(after_tax = if_else(period == "post_tax", 1, 0)) # create dummy for period

head(soda_tax_df_long, 6)
## # A tibble: 6 x 5
##      id treatment period   soda_drank after_tax
##   <int>     <int> <chr>         <dbl>     <dbl>
## 1     1         0 pre_tax       1688.         0
## 2     1         0 post_tax      1706.         1
## 3     2         0 pre_tax        427.         0
## 4     2         0 post_tax       438.         1
## 5     3         0 pre_tax        566.         0
## 6     3         0 post_tax       560.         1

We can see that under our long format, we have two entries for every individual. We have our variable after_tax, which represents \(P_t\), where 0 and 1 are pre- and post-tax periods respectively. We can now render our regression based on the formula:

\[Y_{it} = β_0 + β_1D^*_i + β_2P_t + β_{DD}D^∗_i × P_t + q_{it}\]

did_long <- lm(soda_drank ~ treatment + after_tax + treatment * after_tax, data = soda_tax_df_long)

We can think about the results of this regression in terms of this table, where \(\beta{DD}\) is the difference in differences in group means:

dif_in_dif

stargazer(did_long, type = "html")
Dependent variable:
soda_drank
treatment 2.827
(3.799)
after_tax 14.967***
(3.799)
treatment:after_tax -149.744***
(5.373)
Constant 508.306***
(2.686)
Observations 15,000
R2 0.117
Adjusted R2 0.117
Residual Std. Error 164.502 (df = 14996)
F Statistic 664.480*** (df = 3; 14996)
Note: p<0.1; p<0.05; p<0.01

If we go back to look our did_model, we can notice that by regressing the results we got at that stage were our \(\beta_2\) and \(\beta{DD}\).


1.4 Drafting some brief recommedations

Based on your analysis of the data at hand, you decide to recommend that the tax measure should move forward in the rest of Pawnee. You state that it is a very good example of a pigouvian tax, which captures the negative externalities not included in the market price of sugar-added drinks. The findings suggest that the tax reduced the weekly sugar-added drink consumption by about 150 luquid ounces (almost 4.5 liters).

Your evaluation report is so convincing that the Director of the Parks Department, Ron Swanson, is even doubting libertarianism.






2. Fixed-Effects


2.1 Measuring the effect of a carbon tax on national carbon dioxide emissions per capita


You are hired as an outside consultant by the Organization of Economic Non-Cooperation for Development (OENCD), they are interested in studying the effect of a carbon tax on national carbon dioxide emissions per capita. You are provided with data for the twenty-members of the organization from 2009 to 2019. The data can be called a true-panel based on the description given in the lecture



2.2 Exploring the data

carbon_tax_df <- read.csv("./data/carbon_tax_df.csv") # simulated data
names(carbon_tax_df) # to check the names of the variables in our data
## [1] "country_name"   "country_code"   "year"           "tax"           
## [5] "income_class"   "co2_per_capita"


Our dataset carbon_tax_df, contains the following information:

  • country_name: Name of the country
  • country_code: Three-letter country code
  • year: Year
  • tax: Dummy for whether the carbon tax was in place
  • income_class: Categorical variable with income label (Low to High)
  • co2_per_capita: carbon dioxide emissions per capita in metric tons (T)


Let’s explore who had the tax in place at what point

We can use what we have learnt about the base R table() function, to check how many countries had a tax in place every year.

table(carbon_tax_df$tax, carbon_tax_df$year) %>% 
  kable() %>% # create kable table
  kable_styling() # view kable table
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 12 12 10 9 8 8 5 4 4 4 5
1 0 0 2 3 4 4 7 8 8 8 7



Let’s explore visually the levels of carbon dioxide emmissions

ggplot(carbon_tax_df, aes(x = year,
                          y= co2_per_capita, 
                          color = factor(tax))) +
  geom_point() + #create scatterplot
  theme_minimal() + 
  theme(legend.position="bottom") + #move legend to the bottom
  labs(title = "Exploratory plot of CO2 emissions per capita",
       x = "Year",
       y = "CO2 emissions in metric tons (T)",
       color = "Carbon tax") +
  scale_colour_discrete(labels = c("No", "In place")) #change labels of the legend

WHAT DO WE SEE HERE?

Note: The exaploratoring our data portions of this lab script will not be directly applicable to this week’s assignment. Still, exploratory data analysis will be very useful for the extension portion of your final replication paper. Summarizing, graphing, and exploring your data will be critical to discover patterns, to spot anomalies, and to check assumptions



2.3 Modeling and estimating

We have seen during the lecture that true panel data can help us decompose the error term. With a true-panel we can capture all unobserved, unit- and time-specific factors.

In the example at hand, we can think of unit-specific factors as characteristics of individual countries that are constant over time (e.g. a country that just loves big pick-up trucks). We can also think about time-specific factors that affect all countries (e.g. global economic shocks).

We can formulate this as:

\[Y_{it} = β_0 + β_1D_it + \nu_{i} + θ_t + η_{it}\] where \(\nu_i\) reflects the time-invariant traits of the units \(\theta_t\) reflects the time-specific factors that affect everyone and \(\eta_{it}\) is the idiosyncratic error
We will move forward by creating three models:

  • A naive model, where we will regress co2_per_capita on tax.
  • A model with unit-fixed effects, where we will capture the \(\theta\) portion of our error
  • A model with time- and unit-fixed effects, where we will capture our \(\nu\) and \(\theta\) portions of our error term





2.3.1 Naive modeling

naive_carbon <- lm(co2_per_capita ~ tax, data = carbon_tax_df)

stargazer(naive_carbon, type = "html")
Dependent variable:
co2_per_capita
tax -6.261***
(0.566)
Constant 10.627***
(0.352)
Observations 132
R2 0.485
Adjusted R2 0.481
Residual Std. Error 3.166 (df = 130)
F Statistic 122.394*** (df = 1; 130)
Note: p<0.1; p<0.05; p<0.01

This model is telling us that on average, the \(CO_2\) emmissions per capita are reduced by 6.2 metric tons when a carbon tax is put in place. Still, after all the work we have done throughout the semester, we understand that there may be a plethora of factors that could be skewing the results of this bivarate regression.




2.3.2 Unit-fixed effects

We will learn two ways of gathering unit- and time-fixed effects in R. First, we will perform Least Squares Dummy Variables (LSDV) estimation with lm(), where we essentially get an individual estimate for each unit. Second, we will run our model with plm(), which will do the same mechanics, yet it will not render each of the units intercept.

lsdv_unit_fe <- lm(co2_per_capita ~ tax + country_name, data = carbon_tax_df)
unit_fe <- plm(co2_per_capita ~ tax, data = carbon_tax_df, index = c("country_name"), model = "within")


Unit-fixed effects with lm() — Least Squares Dummy Variables (LSDV) estimation

stargazer(lsdv_unit_fe, type = "html")
Dependent variable:
co2_per_capita
tax -4.436***
(0.474)
country_nameBorovia 1.507
(0.912)
country_nameCarpania 5.106***
(0.912)
country_nameCorinthia 6.434***
(0.887)
country_nameFreedonia 6.120***
(0.903)
country_nameGlenraven 0.372
(0.951)
country_nameKhemed -0.672
(0.951)
country_nameLaurania 1.533
(0.937)
country_nameParano 3.693***
(0.887)
country_nameRon 2.715***
(0.912)
country_nameRumekistan 7.431***
(0.887)
country_nameTransia 1.883*
(0.968)
Constant 6.912***
(0.627)
Observations 132
R2 0.797
Adjusted R2 0.776
Residual Std. Error 2.079 (df = 119)
F Statistic 38.844*** (df = 12; 119)
Note: p<0.1; p<0.05; p<0.01


Unit-fixed effects with plm()

stargazer(unit_fe, type = "html")
Dependent variable:
co2_per_capita
tax -4.436***
(0.474)
Observations 132
R2 0.424
Adjusted R2 0.366
F Statistic 87.716*** (df = 1; 119)
Note: p<0.1; p<0.05; p<0.01

WHAT DO THESE RESULTS TELL US?

Adding unit-level fixed effects to the model, i.e. accounting for unobserved, time-invariant characteristics of countries and only focusing on within-state variation, an increase the imposition of a carbon tax reduces \(CO_2\) per capita emissions by 4.44 metric tons. Once we have captured the variation amongst countries, we can see that our results from the naive model were substantially biased. We can still try to capture the time-specific portion of the error.

The results from the Least Squares Dummy Variables (LSDV) estimation are read in reference to a baseline. In this case, the constant is representing the intercept for Adjikistan. We can utilize the individual slopes for each country to say that Freedonians emit on average 5.93 more metric tons of \(CO_2\) per capita than Adjikistanians.




2.3.3 Unit- and time-fixed effects

We will perform our regressions with Least Squares Dummy Variables (LSDV) estimation with lm() and our simplified way with plm().

lsdv_unit_time_fe <- lm(co2_per_capita ~ tax + country_name + factor(year), data = carbon_tax_df)
unit_time_fe <- plm(co2_per_capita ~ tax, data = carbon_tax_df, index = c("country_name", "year"),  model = "within", effect = "twoways")


Unit- and time-fixed effects with lm() — Least Squares Dummy Variables (LSDV) estimation

stargazer(lsdv_unit_time_fe, type = "html")
Dependent variable:
co2_per_capita
tax -3.908***
(0.280)
country_nameBorovia 1.267***
(0.418)
country_nameCarpania 4.866***
(0.418)
country_nameCorinthia 6.434***
(0.398)
country_nameFreedonia 5.928***
(0.410)
country_nameGlenraven -0.012
(0.447)
country_nameKhemed -1.056**
(0.447)
country_nameLaurania 1.196***
(0.436)
country_nameParano 3.693***
(0.398)
country_nameRon 2.474***
(0.418)
country_nameRumekistan 7.431***
(0.398)
country_nameTransia 1.450***
(0.459)
factor(year)2010 -4.697***
(0.381)
factor(year)2011 1.890***
(0.384)
factor(year)2012 -3.061***
(0.387)
factor(year)2013 0.106
(0.392)
factor(year)2014 -1.878***
(0.392)
factor(year)2015 -3.018***
(0.414)
factor(year)2016 -3.292***
(0.424)
factor(year)2017 -0.963**
(0.424)
factor(year)2018 -2.488***
(0.424)
factor(year)2019 -1.399***
(0.414)
Constant 8.621***
(0.396)
Observations 132
R2 0.963
Adjusted R2 0.955
Residual Std. Error 0.933 (df = 109)
F Statistic 127.305*** (df = 22; 109)
Note: p<0.1; p<0.05; p<0.01


Unit-fixed effects with plm()

stargazer(unit_time_fe, type = "html")
Dependent variable:
co2_per_capita
tax -3.908***
(0.280)
Observations 132
R2 0.641
Adjusted R2 0.568
F Statistic 194.229*** (df = 1; 109)
Note: p<0.1; p<0.05; p<0.01

WHAT DO THESE RESULTS TELL US?

Now in addition to adding unit-level fixed effects to the model, we control for time-specific factors that affect the global per capita \(CO_2\) emissions. The results suggest that the effect of a carbon-tax leads to a decrease \(CO_2\) emissions of 3.91 metric tons per capita.




2.4 Drafting some brief recommedations

You report back to the Organization of Economic Non-Cooperation for Development (OENCD). Based on your analysis of the data at hand, you suggest that the implementation of a carbon tax does have an effect on national carbon dioxide emissions per capita. Your results show that a carbon tax reduces \(CO_2\) emissions by 3.91 metric tons per capita.

Your results are welcomed binternationally and all states move forward with the measure.

environment_peace